28 research outputs found

    Técnicas big data para el procesamiento de flujos de datos masivos en tiempo real

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111Machine learning techniques have become one of the most demanded resources by companies due to the large volume of data that surrounds us in these days. The main objective of these technologies is to solve complex problems in an automated way using data. One of the current perspectives of machine learning is the analysis of continuous flows of data or data streaming. This approach is increasingly requested by enterprises as a result of the large number of information sources producing time-indexed data at high frequency, such as sensors, Internet of Things devices, social networks, etc. However, nowadays, research is more focused on the study of historical data than on data received in streaming. One of the main reasons for this is the enormous challenge that this type of data presents for the modeling of machine learning algorithms. This Doctoral Thesis is presented in the form of a compendium of publications with a total of 10 scientific contributions in International Conferences and journals with high impact index in the Journal Citation Reports (JCR). The research developed during the PhD Program focuses on the study and analysis of real-time or streaming data through the development of new machine learning algorithms. Machine learning algorithms for real-time data consist of a different type of modeling than the traditional one, where the model is updated online to provide accurate responses in the shortest possible time. The main objective of this Doctoral Thesis is the contribution of research value to the scientific community through three new machine learning algorithms. These algorithms are big data techniques and two of them work with online or streaming data. In this way, contributions are made to the development of one of the current trends in Artificial Intelligence. With this purpose, algorithms are developed for descriptive and predictive tasks, i.e., unsupervised and supervised learning, respectively. Their common idea is the discovery of patterns in the data. The first technique developed during the dissertation is a triclustering algorithm to produce three-dimensional data clusters in offline or batch mode. This big data algorithm is called bigTriGen. In a general way, an evolutionary metaheuristic is used to search for groups of data with similar patterns. The model uses genetic operators such as selection, crossover, mutation or evaluation operators at each iteration. The goal of the bigTriGen is to optimize the evaluation function to achieve triclusters of the highest possible quality. It is used as the basis for the second technique implemented during the Doctoral Thesis. The second algorithm focuses on the creation of groups over three-dimensional data received in real-time or in streaming. It is called STriGen. Streaming modeling is carried out starting from an offline or batch model using historical data. As soon as this model is created, it starts receiving data in real-time. The model is updated in an online or streaming manner to adapt to new streaming patterns. In this way, the STriGen is able to detect concept drifts and incorporate them into the model as quickly as possible, thus producing triclusters in real-time and of good quality. The last algorithm developed in this dissertation follows a supervised learning approach for time series forecasting in real-time. It is called StreamWNN. A model is created with historical data based on the k-nearest neighbor or KNN algorithm. Once the model is created, data starts to be received in real-time. The algorithm provides real-time predictions of future data, keeping the model always updated in an incremental way and incorporating streaming patterns identified as novelties. The StreamWNN also identifies anomalous data in real-time allowing this feature to be used as a security measure during its application. The developed algorithms have been evaluated with real data from devices and sensors. These new techniques have demonstrated to be very useful, providing meaningful triclusters and accurate predictions in real time.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e informátic

    High-Content Screening images streaming analysis using the STriGen methodology

    Get PDF
    One of the techniques that provides systematic insights into biolog ical processes is High-Content Screening (HCS). It measures cells phenotypes simultaneously. When analysing these images, features like fluorescent colour, shape, spatial distribution and interaction between components can be found. STriGen, which works in the real-time environment, leads to the possibility of studying time evolution of these features in real-time. In addition, data stream ing algorithms are able to process flows of data in a fast way. In this article, STriGen (Streaming Triclustering Genetic) algorithm is presented and applied to HCS images. Results have proved that STriGen finds quality triclusters in HCS images, adapts correctly throughout time and is faster than re-computing the triclustering algorithm each time a new data stream image arrives.Ministerio de Economía y Competitividad TIN2017-88209-C2-1-RTIN2017-88209-C2-2-

    Discovering three-dimensional patterns in real-time from data streams: An online triclustering approach

    Get PDF
    Triclustering algorithms group sets of coordinates of 3-dimensional datasets. In this paper, a new triclustering approach for data streams is introduced. It follows a streaming scheme of learning in two steps: offline and online phases. First, the offline phase provides a sum mary model with the components of the triclusters. Then, the second stage is the online phase to deal with data in streaming. This online phase consists in using the summary model obtained in the offline stage to update the triclusters as fast as possible with genetic operators. Results using three types of synthetic datasets and a real-world environmental sensor dataset are reported. The performance of the proposed triclustering streaming algo rithm is compared to a batch triclustering algorithm, showing an accurate performance both in terms of quality and running timesMinisterio de Ciencia, Innovación y Universidades TIN2017-88209-C

    Nearest Neighbors-Based Forecasting for Electricity Demand Time Series in Streaming

    Get PDF
    This paper presents a new forecasting algorithm for time series in streaming named StreamWNN. The methodology has two well-differentiated stages: the algorithm searches for the nearest neighbors to generate an initial prediction model in the batch phase. Then, an online phase is carried out when the time series arrives in streaming. In par-ticular, the nearest neighbor of the streaming data from the training set is computed and the nearest neighbors, previously computed in the batch phase, of this nearest neighbor are used to obtain the predictions. Results using the electricity consumption time series are reported, show-ing a remarkable performance of the proposed algorithm in terms of fore-casting errors when compared to a nearest neighbors-based benchmark algorithm. The running times for the predictions are also remarkableMinisterio de Ciencia, Innovación y Universidades TIN2017-88209-C

    Coronavirus Optimization Algorithm: A Bioinspired Metaheuristic Based on the COVID-19 Propagation Model

    Get PDF
    This study proposes a novel bioinspired metaheuristic simulating how the coronavirus spreads and infects healthy people. From a primary infected individual (patient zero), the coronavirus rapidly infects new victims, creating large populations of infected people who will either die or spread infection. Relevant terms such as reinfection probability, super-spreading rate, social distancing measures, or traveling rate are introduced into the model to simulate the coronavirus activity as accurately as possible. The infected population initially grows exponentially over time, but taking into consideration social isolation measures, the mortality rate, and number of recoveries, the infected population gradually decreases. The coronavirus optimization algorithm has two major advantages when compared with other similar strategies. First, the input parameters are already set according to the disease statistics, preventing researchers from initializing them with arbitrary values. Second, the approach has the ability to end after several iterations, without setting this value either. Furthermore, a parallel multivirus version is proposed, where several coronavirus strains evolve over time and explore wider search space areas in less iterations. Finally, the metaheuristic has been combined with deep learning models, to find optimal hyperparameters during the training phase. As application case, the problem of electricity load time series forecasting has been addressed, showing quite remarkable performance.Ministerio de Economía y Competitividad TIN2017-88209-C

    ERK5/BMK1 is a novel target of the tumor suppressor VHL: implication in clear cell renal carcinoma

    Get PDF
    Hi ha quatre pàgines de material suplementari sense numeracióExtracellular signal-regulated kinase 5 (ERK5), also known as big mitogen-activated protein kinase (MAPK) 1, is implicated in a wide range of biologic processes, which include proliferation or vascularization. Here, we show that ERK5 is degraded through the ubiquitin-proteasome system, in a process mediated by the tumor suppressor von Hippel-Lindau (VHL) gene, through a prolyl hydroxylation-dependent mechanism. Our conclusions derive from transient transfection assays in Cos7 cells, as well as the study of endogenous ERK5 in different experimental systems such as MCF7, HMEC, or Caki-2 cell lines. In fact, the specific knockdown of ERK5 in pVHL-negative cell lines promotes a decrease in proliferation and migration, supporting the role of this MAPK in cellular transformation. Furthermore, in a short series of fresh samples from human clear cell renal cell carcinoma, high levels of ERK5 correlate with more aggressive and metastatic stages of the disease. Therefore, our results provide new biochemical data suggesting that ERK5 is a novel target of the tumor suppressor VHL, opening a new field of research on the role of ERK5 in renal carcinomas

    A novel semantic segmentation approach based on U-Net, WU-Net, and U-Net++ deep learning for predicting areas sensitive to pluvial flood at tropical area

    No full text
    Floods remain one of the most devastating weather-induced disasters worldwide, resulting in numerous fatalities each year and severely impacting socio-economic development and the environment. Therefore, the ability to predict flood-prone areas in advance is crucial for effective risk management. The objective of this research is to assess and compare three convolutional neural networks, U-Net, WU-Net, and U-Net++, for spatial prediction of pluvial flood with a case study at a tropical area in the north of Vietnam. They are relative new convolution algorithms developed based on U-shaped architectures. For this task, a geospatial database with 796 historical flood locations and 12 flood indicators was prepared. For training the models, the binary cross-entropy was employed as the loss function, while the Adaptive moment estimation (ADAM) algorithm was used for the optimization of the model parameters, whereas, F1-score and classification accuracy (Acc) were used to assess the performance of the models. The results unequivocally highlight the high performance of the three models, achieving an impressive accuracy rate of 96.01%. The flood susceptibility maps derived from this research possess considerable utility for local authorities, providing valuable insights and information to enhance decision-making processes and facilitate the implementation of effective risk management strategies

    Generating a seismogenic source zone model for the Pyrenees: A GIS-assisted triclustering approach

    No full text
    Seismogenic source zone models, including the delineation and the characterization, still have a role to play in seismic hazard calculations, particularly in regions with moderate or low to moderate seismicity. Seismic source zones establish areas with common tectonic and seismic characteristics, described by a unique magnitude–frequency distribution. Their definition can be addressed from different views. Traditionally, the source zones have been geographically outlined from seismotectonic, geological structures, and earthquake catalogs. Geographic information systems (GIS) can be of great help in their definition, as they deal rigorously and less ambiguously with the available geographical data. Moreover, novel computer science approaches are now being employed in their definition. The Pyrenees mountain range – in southwest Europe – is located in a region characterized by low to moderate seismicity. In this study, a method based purely on seismic catalogs, managed with a GIS and a triclustering algorithm, were used to delineate seismogenic zones in the Pyrenees. Based on an updated, reviewed, declustered, extensive, and homogeneous earthquake catalog (including detailed information about each event such as date and time, hypocentral location, and size), a triclustering algorithm has been applied to generate the seismogenic zones. The method seeks seismicity patterns in a quasi-objective manner following an initial assessment as to the best suited seismic parameters. The eight zones identified as part of this study are represented on maps to be analyzed, being the zone covered by the Arudy–Arette region to Bagnères de Bigorre as the one with the highest seismic hazard potential.European Commission (EC) 0313-PERSISTAHMinisterio de Economía y Competitividad TIN2017-88209-C2Junta de Andalucía US-126334

    A new big data triclustering approach for extracting three-dimensional patterns in precision agriculture

    Get PDF
    Funding Information: The authors would like to thank the Spanish Ministry of Science and Innovation for the support under the project PID2020-117954RB and the European Regional Development Fund and Junta de Andalucía for projects PY20-00870 and UPO-138516. This work could not have been done without the support and help of the Farmer’s Association of Baixo Alentejo and Francisco Palma during the whole project. Finally, the authors thank António Vieira Lima and Moragri S. A. for giving access to data. Publisher Copyright: © 2022Precision agriculture focuses on the development of site-specific harvest considering the variability of each crop area. Vegetation indices allow the study and delineation of different characteristics of each field zone, generally invisible to the naked-eye. This paper introduces a new big data triclustering approach based on evolutionary algorithms. The algorithm shows its capability to discover three-dimensional patterns on the basis of vegetation indices from vine crops. Different vegetation indices have been tested to find different patterns in the crops. The results reported using a vineyard crop located in Portugal depicts four areas with different moisture stress particularities that can lead to changes in the management of the vineyard. Furthermore, scalability studies have been performed, showing that the proposed algorithm is suitable for dealing with big datasets.publishersversionpublishe

    Mercury content in tinned mussels, common cockles and razor shells commercialized in Galicia (Spain)

    No full text
    Molluscs accumulate heavy metals and impose health hazard to consumers. Mercury is considered as a metal not essential and very toxic. Concentrations of mercury were determined for tinned mussels, common cockle and razors shell, commercialized in Galicia (Spain). Previous sample digestion process in acid medium in microwaves, for the analysis, an anodic stripping voltammetric technique, using a gold disc as working electrode, has been used in order to obtain the metal concentrations in muscle, liver and covering liquid. Results for complete molluscs, expressed in ppm and fresh weight, presented the greater levels in mussels (0.27), followed by common cockle (0.20) and finally the razors shell (0.08). In the portions, mercury contents were: liver > covering liquid > muscle. All observed concentrations of mercury were below the maximum limit permitted for human consumption; therefore, it does not constitute a risk for consumer health.Los moluscos acumulan metales pesados y ponen en riesgo la salud del consumidor. Entre estos metales, el mercurio está considerado como no esencial y muy tóxico. Se determinó la concentración de mercurio en mejillones, berberechos y navajas, en conservas comercializadas en Galicia (España). Las porciones de músculo, hepatopáncreas y líquido de cobertura se sometieron a un proceso de digestión, en medio ácido, en microondas. El mercurio se analizó por voltamperometría de redisolución anódica con electrodo de oro. Los resultados para molusco completo, expresados en ppm y peso fresco, mostraron los mayores niveles para mejillones (0,27), seguidos de berberechos (0,20) y finalmente las navajas (0,08). El contenido de mercurio por porciones fue: hepatopáncreas > líquido de cobertura > músculo. Todas las concentraciones de mercurio estuvieron por debajo del límite máximo permitido para consumo humano. Por tanto, su consumo no supone riesgo alguno para la salud del consumidor.La Xunta de Galicia subvencionó el Proyecto PGIDIT02TAL26101PRS
    corecore